Sr. Big Data Engineer Resume
Sfo, CA
SUMMARY
- Around 9+ years of professional experience in IT, this includes Analysis, Design, Coding, Testing, Implementation and Training in Java and Big Data Technologies working with ApacheHadoop Eco - components, Spark streaming and Amazon Web services(AWS)
- Progressive experience in all phases of iterative Software Development Life Cycle (SDLC)/Agile
- Actively involved in Requirements Gathering, Analysis, Development, Unit Testing and Integration Testing.
- Extensive experience on Hadoopecosystem components like Hadoop, Map Reduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper and Flume.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Yarn, Map Reduce, Pig, Hive, HBase, Zookeeper, Oozie and Flume
- Hands on experience with AWS components like EC2, EMR, S3, and ElasticSearch.
- Expertise in writingHadoopJobs for analyzing data using Spark, Hive, Pig MapReduce, Hive.
- Good understanding of HDFS Designs, Daemons, HDFS High Availability (HA).
- Great experience with Docker container service
- Good understanding and working experience on Hadoop Distributions like Cloudera and Hortonworks.
- I am well versed in open source software such as GeoServer, OpenLayers & Google Maps.
- Good Knowledge in creating event-processing data pipelines using flume, Kafka and Storm.
- Expertise in data transformation & analysis using SPARK, PIG, HIVE
- Build and configured Apache TEZ on Hive and PIG to achieve better responsive time while running MR Jobs.
- Extensive hands on administration with Hortonworks.
- Experience in importing and exporting Terabytes of data using Sqoop from HDFS to Relational Database Systems(RDBMS) and vice-versa
- Extending Hive and Pig core functionality by writing custom UDFs, UDTF and UDAFs.
- Experience in analyzing large-scale data to identify new analytics, insights, trends and relationships with a strong focus on data clustering.
- Experience with Hadoop deployment and automation tools such as Ambari, Cloudbreak, EMR.
- Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, Avro Parquet files.
- Hands on experience in AVRO and Parquet file format, Dynamic Partitions, Bucketing for best Practice and Performance improvement
- Developed Spark SQL programs for handling different data sets for better performance.
- Good knowledge of creating event-processing data using Spark Streaming.
- Experience of semi-structured data processing (XML,JSON, and CSV) in Hive/Impala
- Good working experience onHadoopCluster architecture and monitoring the cluster.
- In-depth understanding of Data Structure and Algorithms.
- Experience in using Zookeeper and Oozie operational services to coordinate clusters and scheduling workflows
- Excellent understanding and knowledge of NOSQL databases like HBase, and Mongo DB.
- Experience in implementing standards and processes forHadoopbased application design and implementation.
- Worked with cloud services like Amazon Web Services (AWS) and involved in ETL, Data Integration and Migration
- Extensive experience working in JavaScript for client side validations and implemented AJAX with JavaScript for reducing data transfer overhead between user and server.
- Extensive experience working in Oracle, SQL Server and My SQL database. Hands on experience in application development using Java and RDBMS
TECHNICAL SKILLS
Hadoop/Big Data: HDFS, Map Reduce, Spark Core, Spark Streaming, Spark SQL, Hive, Tez, Pig, Sqoop, Flume, Kafka, Oozie, NiFi and ZooKeeper, Docker.
AWS Components: EC2, S3, RDS, Redshift, EMR, DynamoDB, Lambda, RDS, SNS, SQS
No SQL Databases: HBase, Cassandra, MongoDBLanguages C, C++, Java, Scala, J2EE, Python, PL/SQL, Pig Latin, HiveQL, UNIX shell scripts
Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL, RMI, JMS, Java Script, JSP, Servlets
EJB, JSF, JQuery Frameworks: MVC, Struts, Spring, Hibernate
Operating Systems: Sun Solaris, HP-UNIX, RedHat Linux, Ubuntu Linux and Windows XP/Vista/7/8
Web Technologies: HTML, DHTML, XML, AJAX, WSDL, SOAP
Web/Application servers: Apache Tomcat, WebLogic, JBoss
Databases: Oracle 9i/10g/11g, DB2, SQL Server, MySQL, Teradata
Tools: and IDE: Eclipse, NetBeans, Toad, Maven, SBT, ANT, Hudson, Sonar, JDeveloper, Assent PMD, DB Visualizer
Network Protocols: TCP/IP, UDP, HTTP, DNS, DHCP
PROFESSIONAL EXPERIENCE
Confidential, SFO, CA
Sr. Big Data Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Spark.
- Wrote Data transformation script using hive, Map reduce (Python)
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common word2vec data model, which gets the data from Kafka in near real time and Persists into Cassandra.
- Geospatial modeling, scripting and geostatistical application development for cloud based solution utilizing Map Reduce Hadoop
- Operating the cluster on AWS by using EC2, EMR, S3 and Elastic Search.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Migrated existing MapReduce programs to Spark using Scala and Python
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Build Spark 1.6.1 source code over yarn for matching of production Cloudera (CDH 5.7) Hadoop 2.7 version.
- Working with Hadoop distributions like Cloudera, Hortonworks using management tools like Cloudera Manager and Hortonworks Ambari.
- Developed Scala scripts, UDFFs using both Data frames in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Implemented Proof of Concepts on Hadoop stack and different big data analytic tools, Migration from different databases
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Adding configuring ESRI jar to execute geospatial quires in hive and beeline.
- Optimizing of existing word2vec algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's in development of Chatbot using OpenNLP and Word2Vec.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Initial deployment of 5 node Hortonworks distributed environment on Amazon Web services(AWS).
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Hands on experience in AWS Cloud in various AWS services such as Red shift cluster, Route 53 domain configuration.
- On demand secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.
- Extensive knowledge of working on NiFi.
- Used AWSservices like EC2 and S3 for small data sets.
- Virtualized the servers using the Docker for the test environments and dev-environments needs. And also configuration automation using Docker containers
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Created Hive tables and load the data using Sqoop and worked on them using Hive QL
- Responsible for developing custom UDFs, UDAFs and UDTFs in Pig and Hive.
- Optimizing the Hive Queries using the various files format like JSON,AVRO, ORC, and Parquet
- Implemented Spark SQL to connect to Hive to read the data and distributed processing to make highly scalable
- Analyze the tweets json data using hive SerDe API to deserialize and convert into readable format
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow to run multiple Spark Jobs in sequence for processing data
- Build Tez source code and configured on Hive and achieved very good responsive time ( <1 min) while running the huge Hive queries which used to take longer time( > 30 mins)
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Processed application Weblogs using flume and load them into Hive for analyzing the data
- Generated different types of reports using HiveQL for business to analyze the data feed from sources
- Implemented RESTful Web Services to interact with Oracle/Cassandra to store/retrieve the data.
- Generated detailed design documentation for the source-to-target transformations.
- Wrote UNIX scripts to monitor data load/transformation.
- Involved in planning process of iterations under the Agile Scrum methodology
Environment: Hadoop, Cloudera Manager, Hortonworks HDP 2.0, Cloudbreak, OpenGeo Suite, Google Maps, HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Spark, Oozie, Zookeeper, AWS, Docker, RDBMS/DB, MySQL, CSV, AVRO data files.
Confidential, Carrolton, TX
Sr. Big Data Engineer
Responsibilities:
- Involved in Design and Development of technical specifications.
- Developed multiple Spark jobs in PySpark for data cleaning and preprocessing.
- Analyzed large data sets by running Hive queries and Pig scripts.
- Involved in creating Hive tables, and loading and analyzing data using Hive queries.
- Developed simple/complex MapReduce jobs using Hive and Pig.
- Worked with Hortonworks support to resolve the issues
- Loaded and transformed large sets of structured, semi structured and unstructured data.
- Involved in running Hadoop jobs for processing millions of records of text data.
- Worked with application teams to install Operating Systems, Hadoop updates, patches, and version upgrades as required.
- Proficient in using Cloudera Manager, an end to end tool to manage Hadoop operations.
- Responsible for managing data from multiple data sources.
- Experienced in running Hadoop streaming jobs to process terabytes of XML format data.
- Experience in optimization of MapReduce algorithm using combiners and partitions to deliver the best results and worked on Application performance optimization.
- Processed HDFS data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
- Developed Merge jobs in Python to extract and load data from MySQL database to HDFS
- Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to load data files.
- Developed Python scripts to monitor health of MongoDB databases and perform ad-hoc backups using Mongo dump and Mongo restore.
- Wrote the PIG UDF in for converting Date format and time stamp formats from the unstructured files to required date formats and processed the same.
- Created 30 buckets for each Hive table based on clustering by client Id for better performance (optimization) while updating the tables.
- Processed HDFS data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
- Wrote Pig Scripts to generate MapReduce jobs and performed ETL procedures on the data in HDFS.
- Involved in emitting processed data from Hadoop to relational databases or external file systems using SQOOP, HDFS GET or Copy to Local.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
- Expert in importing and exporting data into HDFS using Sqoop and Flume.
- Experience in using Sqoop to migrate data back and forth from HDFS and MySQL or Oracle and deployed Hive and HBase integration to perform OLAP operations on HBase data.
- Written shell scripts to pull the data from Tumbleweed server to cornerstone staging area.
- Closely worked with Hadoop security team and infrastructure team to implement security.
- Implemented authentication and authorization service using Kerberos authentication Protocol
Environment: Hadoop, Cloudera, Hortonworks HDP 2.0, OpenGeo Suite, Google Maps, MapReduce, Hive, pig, spring batch, Scala, Sqoop, Bash Scripting, Spark RDD, SparkSql.
Confidential, Bloomington, IL
Hadoop Developer
Responsibilities:
- Gathering data from multiple sources like Teradata, Oracle and SQL Server using Sqoop and loading to HDFS
- Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
- Responsible for cleansing and validating data.
- Responsible for writing Map-Reduce job which joins the incoming slices of data and pick only the fields needed for further processing.
- Finding the right join conditions and create datasets conducive to data analysis
- Involved in loading data from UNIX file system to HDFS.
- Installed and configured Hive and also written Hive UDFs.
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Devised procedures that solve complex business problems with due considerations for hardware/software capacity and limitations, operating times and desired results.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Provided quick response to ad hoc internal and external client requests for data and experienced in creating ad hoc reports.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Worked hands on with ETL process.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Extracted the data from Teradata into HDFS using Sqoop.
- Analyzed the data by performing Hive queries and running Pig scripts to know user behavior like shopping enthusiasts, travelers, music lovers etc.
- Wrote REST Web services to expose the business methods to external services.
- Exported the patterns analyzed back into Teradata using Sqoop.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Installed Oozie workflow engine to run multiple Hive.
- Developed Hive queries to process the data and generate the data cubes for visualizing
Environment: Hadoop, MapReduce, HDFS, Hive, Flume, Sqoop, Cloudera, Oozie, UNIX.
Confidential - Pittsburgh, PA
Hadoop Developer
Responsibilities:
- Responsible for loading the customer's data and event logs from Oracle database, Teradata into HDFS using Sqoop
- End-to-end performance tuning ofHadoopclusters andHadoopMapReduce routines against very large data sets.
- Responsible for building scalable distributed data solutions using Hadoop.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce,
- Loaded data into HDFS and extracted the data from MySQL into HDFS using Sqoop.
- Implementing MapReduce programs to analyze large datasets in warehouse for business intelligence
- Written the Spouts and Bolts after collecting the real stream customer data from Kafka broker to process and store into HBASE.
- Analyze the log files and process through Flume
- Experience in optimization of MapReduce algorithm using combiners and partitions to deliver the best results and worked on Application performance optimization.
- Developed HQL queries to implement the select, insert, update and operations to the database by creating
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Developed simple to complex Map/Reduce jobs using Java, and scripts using Hive and Pig.
- Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) for data ingestion
- Implemented business logic by writing UDFs in Java and used various UDFs from other sources.
- Experienced on loading and transforming of large sets of structured and semi structured data.
- Managing and Reviewing Hadoop Log Files, deploy and Maintaining Hadoop Cluster.
- Export filtered data into HBase for fast query.
- Involved in creating Hive tables, loading with data and writing Hive queries.
- Created data-models for customer data using the Cassandra Query Language.
- Ran many performance tests using the Cassandra-stress tool in order to measure and improve the read and
- Involved in developing Shell scripts to orchestrate execution of all other scripts (Pig, Hive, and MapReduce)
- And move the data files within and outside of HDFS.
- Queried and analyzed data from Datastax Cassandra for quick searching, sorting and grouping.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop
Environment: Apache Hadoop (Cloudera), Hbase, Hive, Pig, Map Reduce, Sqoop, Oozie, Eclipse, Java