Sr. Data Engineer Resume
Bethesda, MD
SUMMARY
- Over 8+ years of professional IT experience in variety of industries, which includes hands on experience of Big Data Ecosystem in ingestion, storage, querying, processing and analysis of big data.
- Experience in different Hadoop distributions like Cloudera(CDH), Hortonworks(HDP), Elastic MapReduce(EMR).
- Hands on experience in developing predictive models by using machine learning.
- Implemented various machine learning techniques like Random forest, k - means, logistic regression for predictions and pattern identification using Spark-MLib.
- Involved in performing the Linear Regression using Scala API and Spark.
- Responsible for building Hadoop clusters with Hortonworks/Cloudera Distribution and integrate with Pentaho Data Integration (PDI) server.
- Extensively worked on various machine learning algorithms, used nltk a natural language processing(NLP) library in Python to build models.
- Good Knowledge of Deep learning, Neural Networks, Convolutional Neural Networks (CNN).
- Extensive experience in Spark/Scala, Pyspark, MapReduce(MRv1) and MapReduce MRv2(YARN).
- Involved in creation and designing of data ingest pipelines using technologies such as Apache Kafka.
- Used Kafka to load data into HDFS and move data into NoSQL databases.
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
- Involved in data ingestion into HDFS using Sqoop for full load and Flume for incremental load on variety of sources like web server, RDBMS and Data API’s.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Experience working on Spark SQL and Data frames for faster execution of Hive queries using Spark SQLContext.
- Involved in converting MapReduce jobs into transformations and actions using Spark RDDs and Spark Data frames.
- Experience in creating Pig and Hive UDFs in order to analyze the data efficiently.
- Hands-on experience designing, reviewing, implementing and optimizing data transformation processes in the Hadoop and Talend/Informatica ecosystems.
- Experience with Sequence files, AVRO, ORC, parquet file formats and gzip, snappy, bz2 compressions.
- Good experience in creating and designing data ingest pipelines using technologies such as Apache Storm - Kafka.
- Integrated Apache Storm with Kafka t perform web analytics. Uploaded clickstream data from Kafka to HDFS, Hbase and Hive by integrating with Storm.
- Experience in importing and exporting data using Sqoop from HDFS to relational database systems and vice-versa.
- Strong experience in working with Elastic MapReduce and setting up environments on Amazon AWS EC2 instances.
- Hands on NoSQL database experience with HBase and Cassandra.
- Installation of Solr and configuring Solr Indexing of near real-time data.
- Experience using CQL to execute queries on data persisting in the Cassandra cluster.
- Involved in processing of data using Apache Tez and storing it to Cassandra.
- Extensively worked on MongoDB concepts like locking, transactions, indexes, sharding, replication and schema design.
- Extracted files from MongoDB through Sqoop and placed in HDFS and processed it.
- Developed core search components using Apache Solr.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, JobTracker, TaskTracker, NameNode, DataNode and MapReduce concepts
- Experience with configuration of Hadoop Ecosystem components: Hive, HBase, Pig, Sqoop, Mahout, Zookeeper
- Knowledge of job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Experience in building, maintaining multiple Hadoop clusters (prod, dev etc.,) of different sizes and configuration and setting up the rack topology for large clusters.
- Loading data from different source databases and files into Hive using Talend tool.
- Experience creating reports and building dashboards using Tableau.
- Created views in Tableau Desktop that were published to internal team for review and further data analysis and customization using filters and actions.
- Experience in optimization of Mapreduce algorithm using combiners and partitioners to deliver the best results.
- Proficient in using Cloudera Manager, an end to end tool to manage Hadoop operations
- Worked with BI teams in generating the reports and designing ETL workflows on Tableau.
- Followed Test driven development of Agile, Water Fall and RUP Methodology to produce high quality software.
- Expertise in developing distributing business applications using EJB implementing Session beans for business logic, Entity beans for persistence logic and Message driven beans for asynchronous communication.
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Involved in installing and configuration of distribution systems as a Hortonworks Distribution (HDP)and worked on full SDLC as an agile methodology.
- Hands on experience in developing the applications with Java, J2EE, J2EE - Servlets, JSP, EJB, SOAP, Web Services, JNDI, JMS, JDBC2, Hibernate, Struts, Spring, XML, HTML, XSD, XSLT, PL/SQL, Oracle10g and MS-SQL Server RDBMS.
- Experience in Database design, Entity relationships, Database analysis, Programming SQL, Stored procedures PL/ SQL, Packages and Triggers in Oracle and SQL Server on Windows and UNIX.
- Worked on different OS like UNIX/Linux, Windows NT, Windows XP, and Windows 2K.
TECHNICAL SKILLS
Big Data: Cloudera Distribution, HDFS, Zookeeper, Yarn, Data Node, Name Node, Resource Manager, Node Manager, Mapreduce, PIG, SQOOP, Hbase, Hive, Flume, Cassandra, MongoDB, Oozie, Kafka, Spark, Storm, Scala, Impala
Operating System: Windows, Linux, Unix.
Languages: Java, J2EE, SQL, PYTHON, Scala
Databases: IBM DB2, Oracle, SQL Server, MySQL, PostGres
Web Technologies: JSP, Servlets, HTML, CSS, JDBC, SOAP, XSLT.
Version Tools: GIT, SVN, CVS
IDE: IBM RAD, Eclipse, IntelliJ
Tools: TOAD, SQL Developer, ANT, Log4J
Web Services: WSDL, SOAP.
ETL: Talend ETL, Talend Studio
Web/App Server: UNIX server, Apache Tomcat
PROFESSIONAL EXPERIENCE
Confidential, Bethesda, MD
Sr. Data Engineer
Responsibilities:
- Primary responsibilities include building scalable distributed data solutions using Hadoop ecosystem.
- Worked on Hortonworks Data Platform Hadoop distribution for data querying using Hive to store and retrieve data.
- Implemented Hive optimized joins to gather data from different sources and run ad-hoc queries on them.
- Performed custom aggregate functions using Spark SQL and performed interactive querying.
- Co-ordination with Hortonworks, development and the operations team on the platform level issues.
- Extensively worked on creating combiners, partitioning, distributed cache to improve performance of MapReduce jobs.
- Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark SqlContext.
- Used Sqoop transfer data between databases and HDFS and used Kafka to stream the log data from servers.
- Used Pig to perform data transformations, event joins, filter and some pre-aggregations before storing the data onto HDFS.
- Implemented different analytical algorithms using MapReduce programs to apply on top of HDFS data.
- Worked on MongoDB database concepts such as locking, transactions, indexes, sharding, replication and schema design.
- Implemented read references in MongoDB replica set.
- Used Apache Tez for processing data and storing it in MongoDB.
- Familiar with MongoDB write concern to avoid loss of data during system failures.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Extensively performed CRUD operations like put, get, scan, delete, update etc., on HBase database.
- Wrote Hive Generic UDF’s to perform business logic operations at table level.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and preprocessing with Pig, Hive, Sqoop.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Used Hive join queries to join multiple tables of a source system and load them into Elastic Search Tables.
- Used Apache Kafka as messaging system to load log data, data from applications into HDFS system.
- Developed POC using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
- Involved in converting Hive queries into Spark transformations using Spark RDDs, Python and Scala.
- Worked on various file formats and compression Text, Avro, Parquet file formats, snappy, bz2, gzip compression.
- Implemented test scripts to support test driven development and continuous integration.
- Scheduling cron jobs for file system check using fsck and wrote shell scripts to generate alerts.
- Data scrubbing and processing with Oozie.
- Loading the analyzed Hive data into NOSQL databases like Hbase, MongoDB.
- Provide Technical support for the Research in Information Technology program
- Manage and upgrade Linux and OS X server systems.
- Responsible for installation, configurations and management for Linux Systems
Environment: Hadoop, Java, MapReduce, HDFS, Hive, Pig, Sqoop, Flume, Python, Spark, Impala, Scala, Kafka, Shell Scripting, Eclipse, Cloudera, MySQL, Talend, Cassandra
Confidential, MD
Big Data Engineer
Responsibilities:
- Plan, design and launch solution for building Hadoop cluster on cloud by using EMR and EC2 of AWS.
- Converted Mapreduce jobs into transformations and actions using Spark RDDs and Spark Dataframes, Datasets.
- Responsible for writing Apache Pig scripts and Hive queries for data quality analysis.
- Used Flume to retrieve data from many sources into Hadoop Distributed File System(HDFS).
- Migrating the needed data from MySQL into HDFS using Sqoop and importing various formats of unstructured data from logs into HDFS using Flume
- Collected and aggregated large amounts of log data using Apache Flume and storing in into HDFS for future analysis.
- Developed core search component using Apache Solr.
- Installation of Solr and configuring Solr for Indexing of near real-time data.
- Developed Spark-Cassandra connector to load data to and from Cassandra.
- Worked with CQL to execute queries on data persisting in the Cassandra cluster.
- Designed Spark applications in Scala and Python to interact with data stored in HDFS using SQLContext and access Hive tables using HiveContext.
- Used Impala query engine to write queries to get faster results.
- Defined job workflows as per dependencies in Oozie.
- Developed the warehouse specific DataLake using Hive and Pig scripting and ETL Talend pipelines for populating the DataMarts for user/business consumption using Hive/Impala and Spark.
- Experience in managing and reviewing Hadoop log files.
- Migrated historical data from existing warehouses to Hadoop using Sqoop for scalable processing of the data and the eventual insights are sqooped back.
- Worked on Talend to run ETL jobs on the data in HDFS.
- Built services, deployed models, algorithms, performed model training and provided tools to make our infrastructure more accessible.
- Wrote shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
- Responsible for Linux System Administration, DevOps, AWS Cloud platform and its features.
- Implemented Elastic Search to decrease query times and increase search capabilities.
- Extensively used S3 to store data and deployed EC2 instances using Elastic MapReduce(EMR) to perform analysis.
- Configured Virtual Private Cloud(VPC) which includes various subnets for different teams to deploy their own clusters and increase or decrease the number of instances depending on the need.
- Support Data Analysis projects using Elastic MapReduce on the Amazon Web Services(AWS) cloud
- Used Apache Nifi for ingestion of data from the IBM MQ's (Messages Queue)
- Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
- Started using Apache NiFi to copy the data from local file system to HDP.
- Scheduled data loading from multiple sources into Redshift using Kinesis Stream.
- Use copy and unload data to/from Redshift database between In-premises and AWS.
- Designed Elastic Load Balancer(ELB) and launched in subnets to distribute network traffic to multiple instances.
- Supporting Redshift Database using STL, SVL, STV, SVV system tables/views, unload into S3/ In-Premises, copy from PostgreSQL, schedule ELT from multiple sources using Kinesis Stream
- Worked on several Amazon Web Services like EC2, ELB, VPC, S3, CloudFront, IAM, RDS, Route53, Cloudwatch, RedShift, SNS, SQS, SES, lambda to namely few.
Environment: Cloudera, HDFS, Hive, HQL scripts, Map Reduce, Java, Cassandra, Pig, Sqoop, Kafka, Impala, Shell Scripts, Python Scripts, Spark, Scala, Oozie.
Confidential, TX
Hadoop Developer
Responsibilities:
- Involved in implementing Hadoop Cluster and data integration in developing large-scale system software.
- Worked on analyzing Hadoop Distribution(HDP) and different Big Data analytic tools.
- Worked on ORC File format, bucketing, partitioning for hive performance enhancement and storage improvement.
- Setup security using Kerberos and AD on Hortonworks cluster.
- Developed various python scripts to find vulnerabilities with SQL queries by doing SQL injection, permission checks and performance analysis.
- Worked extensively with Sqoop for importing data.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Extensively used Pig for data cleansing.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Created queries in Hive to process large sets of structured, semi-structured and unstructured data and store in Managed and external tables and also created partition tables.
- Experience with Sequence files, AVRO, ORC, parquet file formats and gzip, snappy, bz2 compressions.
- Develop pig scripts to convert the data from Text file to Avro format.
- Performed upgrades and configuration changes. Commissioned/decommission modes as needed on the go.
- Evaluated usage of Oozie for Workflow Orchestration.
- Supported and Monitored Mapreduce programs running on cluster and provide production support.
- Used Oozie for fetching out data on the periodic basis and in periodic timely fashion.
- Managed hadoop operations with multi-node HDFS cluster using Cloudera Manager.
- Involved in ETL transformation of OLTP data to the Data Warehouse implementing all transformations using SSIL and SQL commands.
- Created SSIS packages to extract data from OLTP to OLAP systems and scheduled jobs to call the packages and stored procedures.
- Used Maven extensively for building jar files of MapReduce programs and deployed to cluster.
- Involved in processing the data in the hive tables using HQL high-performance, low-latency queries.
Environment: Hadoop, HDFS, MapReduce, Yarn, Hive, PIG, Oozie, Sqoop, HBase, Flume, Linux, Shell scripting, Java, Eclipse, SQL
ConfidentialJava Developer
Responsibilities:
- Involved in requirements gathering and analysis from the existing system. Captured requirements using Use Cases and Sequence Diagrams.
- Designed physical and logical data model and data flow diagrams.
- Analyzed and modified existing code wherever required and responsible for gathering, documenting and maintaining business and system requirements and developing designs document.
- Developed Enterprise Java Beans (Session Beans) to perform middleware services and interact with DAO layer to perform database operations like update, retrieve, insert and delete.
- Implemented Ant and Maven build tools to build jar and war files and deployed war files to target servers.
- Used Rally tool for the development of Agile-lifecycle management creating the stories, updating the tasks and reporting the bugs.
- Involved in schema design and XML page implementation.
- Developed Message Driven Bean components with WebSphere MQ Series for e-mailing and Data transfer between client and the providers.
- Created business classes depending upon the requirements.
- Involved in developing interface for WEB pages like user registration, login and registered access control for users depending on logins using HTML, CSS and JavaScript/AJAX.
- Analyzed data using complex SQL queries, across various databases.
- As part of development, I was involved in gathering requirements.
- Performed GitHub/GitHub-Desktop bash and terminal commands to clone, fetch, merge and push the code and created pull requests for changes that are made.
- Involved in database design writing DDL and DML scripts.
- Created several Exception classes to catch the error for a bug free environment and logged the whole process using log4j, which gives the ability to pinpoint the errors.
- Used DB2 Database to store the system data
- Involved in creating database objects like views, tables, procedures
- Extensively used the advanced features of PL/SQL like Records, Tables, Ref Cursors, Object types and Dynamic SQL
- Developing, implementing and unit testing of the application environment.
Environment: Java, J2EE, Eclipse, Web Logic Application Server, Oracle, JSP1, HTML, JavaScript, JMS, Servlets, UML, XML, Struts, Web Services, WSDL, SOAP, UDDI, ANT, JUnit, Log4j.
Confidential
Java Developer
Responsibilities:
- Used both WebLogic portal 9.2 for Portal development and WebLogic 8.1 for Data services programming.
- Involved in gathering requirements from business users
- Experience in Design and Development of database systems using Relational Database Management Systems including Oracle MS SQL Server and MySQL.
- Upgradation of WebLogic servers in development, testing and production environment and applying patch and service packs.
- Worked on creating EJBs that implements business logic.
- WebLogic Administration, Monitoring and Troubleshooting using Admin Console and JMX and monitoring server health and service packs.
- Involved in designing and development of the e-commerce site using JSP, Servlet, EJBs, JavaScript and JDBC.
- Worked with data migration team, providing the mapping between the source and target systems.
- Validated all forms using struts validation framework and implemented Tiles framework in the presentation layer.
- Developed the Web Interface using Struts, JavaScript, HTML and CSS.
- Developed JSP pages with Struts and EJB for implementing different search pages for transaction of each module.
- Identified and implemented the user actions (Struts Action classes) and forms(Struts Form classes) as a part of Struts framework.
- Involved in the design and coding of the data capture templates, presentation and component templates.
- Designed intermediate database tables as per technical specifications.
- Created web front end using JSP pages integrating AJAX and JavaScript coding that provide a rich browser based user interface.
- Implemented database using SQL Server
- Involved in bug fixing of various applications reported by the testing team in the application during the integration
- Designed Tables and indexes
- Developed PL/SQL packages, procedures, functions to migrate the data from source to stage and stage to the targeting systems.
- Wrote SQL queries, stored procedures, and triggers to perform back-end database operations by using SQL Server 2005.
- Responsible for performing code reviews.
Environment: Java, J2EE, Eclipse, Weblogic Application Server, Oracle, JSP, HTML, JavaScript, JMS, Servlets, UML, XML, Struts, WSDL, SOAP, UDDI, ANT, JUnit, Log4j.
