Sr Big Data Engineer Resume
Charlotte -, NC
SUMMARY
- Around 8+ years of professional IT experience involving project development, implementation, deployment and maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
- 5+ years of experience in Hadoopcomponents like MapReduce, Flume, Kafka, Pig, Hive, Spark, HBase, Oozie, Sqoop and Zookeeper.
- Experience in working with different Hadoop distributions like CDH and Hortonworks.
- Good noledge on MAPR distribution & Amazon’s EMR.
- Experienced with teh Spark improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Experienced in developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended teh default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.
- Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
- Good experience in creating data ingestion pipelines, data transformations, data management, data governance and real time streaming at an enterprise level.
- Solid experience in using teh various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.
- Experienced working with Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
- Excellent understanding of Data Ingestion, Transformation and Filtering.
- Design and develop Solutions using C#, Web API, Microsoft Azure techniques .
- Coordinated with teh Machine Learning team to perform Data Visualization using Cognos TM1, PowerBI and Tableau, QlikView.
- Developed Spark and Scala applications for performing event enrichment, data aggregation, data processing and de-normalization for different stake holders.
- Designed new data pipelines and made teh existing data Pipelines to be more efficient.
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing teh HiveQL queries.
- In depth understanding of Hadoop Architecture and its various components such as YARN, Resource Manager, Application Master, Name Node, Data Node, HBase design principles etc.
- Experienced developing iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
- Experienced with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
- Experienced in job workflow scheduling and monitoring tools like Oozie and good noledge on Adobe Analytics and Zookeeper to coordinate teh servers in clusters and to maintain teh data consistency.
- Profound understanding of Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked on NoSQL databases like HBase, Cassandra and MongoDB.
- Experienced with performing CRUD operations using HBase Java Client API and Solr API
- Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.
- Experienced in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and AWS, GCP. Experienced in Hadoop 2.6.4 and Hadoop 3.1.5.
- Experienced writing Shell scripts in Linux OS and integrating them with other solutions.
- Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
- Experienced in using PL/SQL to write Stored Procedures, Functions and Triggers.
- Hands-on experience on fetching teh live stream data from DB2 to HBase table usingSpark Streaming and Apache Kafka.
- Good experience on creating Data Pipelines in SPARKusingSCALA.
- Experienced in developing Spark Programs for Batch and Real-Time Processing. Developed Spark Streaming applications for Real Time Processing.
- Good experience on Spark components like Spark SQL, MLlib, Spark Streaming and GraphX.
- Expertise in integrating teh data from multiple data sources using Kafka.
- Knowledge about unifying data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies.
- Kafka Deployment and Integration with Oracle databases.
- Experienced data processing like collecting, modeling, aggregating, moving from various sources using Apache Kafka.
- Experienced in moving data from Hive tables into Cassandra for real time analytics on hive tables and Cassandra Query Language (CQL) to perform analytics on time series data.
- Good Knowledge in custom UDF's in Hive & Pig for data filtering and Google Analytics.
- Experienced in Apache NIFIwhich is a Hadoop technology and Integrating Apache NIFI and Apache Kafka.
- Hands-on experience in configuring and working with Flume to load teh data from multiple sources directly into HDFS.
- Excellent communication, interpersonal and analytical skills. Also, a highly motivated team player with teh ability to work independently.
TECHNICAL SKILLS
Big Data Space: Hadoop, MapReduce, Pig, Hive, HBASE, YARN, Kafka, Spark, PySpark, Spark SQL, HSQL, Flume, Sqoop, Impala, Oozie, Zookeeper, Ambari, NiFi, Azure, Elastic Search, Solr, MongoDB, Cassandra, Avro, Storm, Parquet, Snappy, AWS, GCP, Airflow, Docker, Scrum, Snowflake
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, Apache EMR
Databases & warehouses: Non-SQL, Oracle, DB2, MySQL, SQL Server, MS Access, Teradata, Sybase IQ, Sybase ASE, HDFS, MemSQL.
Java Space: Core Java, J2EE, JDBC, JNDI, JSP, EJB, Struts, Spring Boot, REST, SOAP, JMS
Languages: Python, Java, JRuby, SQL, PL/SQL, Scala, JavaScript, C#, R, Shell Scripts, C/C++, Go
Web Technologies: HTML, CSS, BOOTSTRAP, JavaScript, AJAX, JSP, DOM, XML, XSLT
IDE: Eclipse, NetBeans JDeveloper, IntelliJ IDEA, Android Studio, Visual Studio, Sublime Text3, Anaconda, Jupyter Notebook, PyCharm.
Operating systems: UNIX, LINUX, Mac OS, Windows, Variants
RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL, DB2.
Version controls: GIT, SVN, CVS
ETL Tools: Informatica, AB Initio, Talend
Reporting: Cognos TM1, Tableau, SAP BO, SAP HANA, Power BI, Looker
PROFESSIONAL EXPERIENCE
Confidential, Charlotte - NC
Sr Big Data Engineer
Responsibilities:
- Worked in Multi ClusteredHadoopEcho-System environment.
- Created MapReduce programs using Java API that filter un-necessary records and find out unique records based on different criteria.
- Performed optimizing Map Reduce Programs using combiners, partitioners, and custom counters for delivering teh best results.
- Experienced in converting teh existing relational database model toHadoopecosystem.
- Installed and configured Apache Hadoop, Hive and Pig environment.
- Worked with Linux systems and RDBMS database on a regular basis so that data can be ingested using Sqoop.
- Reviewed and managed all log files using HBase.
- Implemented Spark Scala and PySpark using Data Frames, RDD, Datasets and Spark SQL for processing of data.
- Created Hive tables and working on them using HiveQL.
- Used Apache Kafka for teh Data Ingestion from multiple internal clients.
- Developed data pipeline using Flume and Spark to store data into HDFS.
- Big data processing usingSpark, AWS, and Redshift.
- Involved in teh process of data acquisition, data pre-processing and data exploration of telecommunication project inSpark.
- Involved in performing teh Linear Regression using Spark MLlib in Scala.
- Continuous monitoring and managing theHadoop cluster through HDP (Hortonworks Data Platform).
- Worked Azure Databricks to develop notebooks of PySpark and Scala for spark transformations.
- Loaded teh CDRs from relational DB using Sqoop and other sources toHadoop cluster by using Flume.
- Implemented data quality checks and transformations using Flume Interceptor.
- Implemented collections & Aggregate Frameworks in MongoDB.
- Experienced in processing large volume of data and skills in parallel execution of process using Talend functionality.
- Efficiently handled periodic exporting of SQL data into Elastic search.
- Involved in loading data from UNIX file system and FTP to HDFS.
- Designed and Implemented Batch jobs using MR2, PIG, Hive, Tez.
- Used Apache Tez for highly optimized data processing.
- DevelopedHivequeries to analyze teh output data.
- Developed workflow in Oozie to automate teh tasks of loading teh data into HDFS.
- Developed Pig Custom UDF's for custom input formats for performing various levels of optimization.
- Involved in maintaining teh Hadoop clusters using Nagios server.
- Used Pig to import semi-structured data coming from Avro files to make serialization faster.
- Configuring high availability multi-coresolrservers using replication, request handlers, analyzers and tokenizers.
- ConfiguredSolrserver to index different content types like HTML, PDF, XML, XLS, DOC, DOCX and other types. Utilized Agile Scrum methodology.
- Loading Data into HBase using Bulk Load and Non-bulk load.
- Used Spark for fast processing of data in Hive and HDFS.
- Performed batch processing of data sources using Apache Spark, Elastic search.
- UsedZookeeperto provide coordination services to teh cluster.
- CreatedHivequeries that halped market analysts spot emerging trends by comparing freshdata with reference tables and historical metrics.
- Wrote teh Shell scripts to monitor teh health check ofHadoop daemon services and respond accordingly to any warning or failure conditions.
- Worked on Reporting tools like Tableau to connect with Hive for generating daily reports.
Environment: Hadoop, HDFS, Pig, Hive, Map Reduce, Scala, PySpark, Spark, Kafka, Flume, Sqoop, Hortonworks, AWS, Redshift, Oozie, Zookeeper, Elastic Search, Avro, Python, Shell Scripting, SQL Talend, Spark, HBase, MongoDB, Linux, Kafka, Solr, Ambari, Talend.
Confidential, Long Beach California
Sr Data Engineer
Responsibilities:
- Experienced in development using Cloudera distribution system.
- Has experience of working on Snow -flake data warehouse.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data
- Designed custom Spark REPL application to handle similar datasets and marketing datasets.fGoog
- Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation
- Performed Hive test queries on local sample files and HDFS files
- Used AWS services like EC2 and S3 for small data sets.
- Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic Map Reduce.
- Developed several Classes using C# and experienced in creating Assemblies and Name Spaces.
- Extensively used PySpark to implement transformations and deployed in Azure HDInsight for ingestion and Hygiene, Identity Resolution process.
- Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
- Used Scala to write code for all Spark use cases. Implemented PySpark jobs for Batch Analysis.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Assigned name to each of teh columns using case class option in Scala.
- Developed multiple Spark Sql jobs for data cleaning
- Created Hive tables and worked on them using Hive QL
- Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS
- Developed Spark SQL to load tables into HDFS to run select queries on top.
- Developed analytical component using Scala, Spark and Spark Stream.
- Used Visualization tools such as Power view for excel, Tableau, Looker for visualizing and generating reports.
- Worked on teh NoSQL databases HBase and mongo DB.
Environment: Hadoop, Hive, Oozie, Java, Linux, Maven, Oracle 11g/10g, Zookeeper, MySQL, Spark, PySpark, Easy Mock, SAS, SPSS, Azure, ALDS, ADF, Delta Lake, Data bricks, BODS, AWS, Python, C#
Confidential, Dearborn, MI
Data Engineer
Responsibilities:
- Using Sqoop, imported and exported data from Oracle and PostgreSQL into HDFS so as to use it for teh analysis
- Migrated Existing MapReduce programs to Spark Models using Python.
- Migrated teh data from Data Lake (hive) into S3 Bucket.
- Done data validation between data present in Data Lake and S3 bucket.
- Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
- Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
- Used Kafka for real time data ingestion.
- Created different topic for reading teh data in Kafka
- Read data from different topics in Kafka.
- Moved data from s3 bucket to snowflake data warehouse for generating teh reports.
- Unit Test Case Development for teh Ultraviolet project for MSIT using C# and Visual Studio.
- Migrated an existing on-premises application to AWS.
- Developed PIG Latin scripts to extract teh data from teh web server output files and to load into HDFS
- Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting
- Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Implemented different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
- Good noledge on Spark platform parameters like memory, cores and executors
- By using Zookeeper implementation in teh cluster, provided concurrent access for hive tables with shared and exclusive locking
Environment: Linux, Apache Hadoop Framework, Junit, Jasmine, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Cloud formation templates, Cloud Watch Logs, Scala, Spark, SQOOP, AWS Kinesis
Confidential, TX
Hadoop Developer
Responsibilities:
- Participated in all teh phases of teh Software development life cycle (SDLC) which includes Development, Testing, Acceptance Testing, Implementation and Maintenance.
- As a Hadoop Developer my responsibility is managing teh data pipelines and data lake.
- Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
- Involved in loading data from UNIX file system to HDFS.
- Installed and configured Hive and written Hive UDFs.
- Transferred teh data using Informatica tool from AWS S3 to AWS Redshift.
- Used Cassandra CQL and Java API's to retrieve data from Cassandra table.
- Developed Spark/Scala, Python for regular expression (regex) project in teh Hadoop/Hive environment with Linux/Windows for big data resources.
- Worked hands on with ETL process using Informatica.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Extracted teh data from utilities of Teradata included Bteq, Fast Export into HDFS using Sqoop.
- Analyzed teh data by performing Hive queries and running Pigscripts to no user behavior.
- Exported teh patterns analyzed back into Teradata with Mload, Tpump, QG utilities using Sqoop.
- Continuous monitoring and managing teh Hadoop cluster through Cloudera Manager.
- Installed Oozie workflow engine to run multiple Hive.
- Designed and Implement test environment on AWS.
Environment: Hadoop, MapReduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, Glue ETL, Pig Script, Cloudera, Oozie, SAP Data services, AWS, Python, Spark, Scala.
Confidential
Data Analyst
Responsibilities:
- Document teh complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Act as technical liaison between customer and team on all AWS technical aspects.
- Setup/Managing CDN on Amazon Cloud Front to improve site performance.
- Was a part of teh complete life cycle of teh project from teh requirements to teh production support
- Created test plan documents for all back-end database modules
- Used MS Excel, MS Access, and SQL to write and run various queries.
- Worked extensively on creating tables, views, and SQL queries in MS SQL Server.
- Worked with internal architects and assisting in teh development of current and target state data architectures.
- Coordinate with teh business users in providing appropriate, effective, and efficient way to design teh new reporting needs based on teh user with teh existing functionality.
- Remain noledgeable in all areas of business operations to identify systems needs and requirements.
Environment: SQL, SQL Server, MS Office, and MS Visio, AWS